Paspaly make the best pearl jewelry in the world. Unfortunately it’s priced accordingly so actual Paspaley purchases are somewhat rare for me and I wondered if we could do some analysis. So I scraped the website on Boxing Day 2017 and again on Boxing Day 2018, and again on Boxing Day 2019, which means we can have a bit of a look at what’s going on.
To be respectful of Paspaley’s content, I haven’t included the scraped html files there, but if you look at the Paspaley website and think there could be something in the html that I haven’t taken advantage of then let me know and I can update the scripts.
There was an update to the Paspaly website at some point in 2018 where they moved away from Shopify. So I’ll just focus on the most recent version of the website here, but the overall process is similar for both years. We don’t want to impact the ability of their website to serve other visitors, so the first step when acquiring data from a website is to check for an API. In this case, it doesn’t look like they have one. The next step is to look at the relevant robot.txt file in case they provide some guidance there: https://www.paspaley.com/robots.txt. In this case it doesn’t give us much information, but it helpfully provides a link to the XML sitemap: https://www.paspaley.com/pub/sitemap.xml.
Screnshot of the XML file that Paspaley makes available of their sitemap.
That gives us a nice list of the products on the website and their address so that we can get details such as prices. We can use the rvest package to quickly grab the relevant information from this file.
library(rvest)
library(tidyverse)
content <- read_html("inputs/2018/sitemap.xml")
each_product <- content %>%
xml_find_all("//url") # Instead of HTML/CSS here we're using XML. It's fairly similar, but just a slightly different syntax. Each product is within a url tag.
# Thanks to Jenny Bryan https://github.com/jennybc/manipulate-xml-with-purrr-dplyr-tidyr
all_products <- data_frame(row = seq_along(each_product), # This just adds a row numbering which makes it easier to go back and check.
nodeset = each_product) # Now we're pulling the relevant XML for each product into a new column. So for each product (row) the relevant nodes will be in that column.
# Thanks to James Goldie https://rensa.co/writing/drilling-into-non-rectangular-data-with-purrr/
all_products <- all_products %>%
mutate(product_link = nodeset %>% # We're first trying to get the URL for each product
html_node("loc") %>%
html_text(trim = TRUE)) %>%
mutate(product_name_existence_test = nodeset %>% # There's a bunch of URLs that aren't products so we want to test for whether there's a name first
html_node("image")) %>%
filter(!is.na(product_name_existence_test)) %>% # Drop it if not a product
mutate(product_name = nodeset %>% # Grab the name of the product
html_node("image") %>%
html_node("title") %>%
html_text(trim = TRUE)) %>%
select(-product_name_existence_test, -nodeset)
write_csv(all_products, "outputs/misc/2018_links_from_xml.csv")
Although the sitemap should be giving us everything, it’s worth seeing if we can use a different method to get any other information that’s missing. And we’ll also use this alternative information to allocate collections and categories to the products.
Looking at the website, Paspaley organises the products by category (e.g. bracelet, earrings, etc).
Screnshot of the categories page.
It also categorises them by collection (e.g. Monsoon, Rockpool, etc).